Search CORE

34 research outputs found

Information-theoretic causal inference of lexical flow

Author: Dellert Johannes
Publication venue: Language Science Press
Publication date: 01/01/2019
Field of study

This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision

OAPEN Library

Institutional Repository of the Freie Universität Berlin

ZENODO

Publikationsserver der Universität Tübingen

Language Science Press

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Directory of Open Access Books (DOAB)

Information-theoretic causal inference of lexical flow

Author: Dellert Johannes
Publication venue
Publication date
Field of study

OAPEN Library

Developing a TT-MCTAG for German with an RCG-based parser

Author: Dellert Johannes
Kallmeyer Laura
Lichte Timm
Maier Wolfgang
Parmentier Yannick
Publication venue
Publication date: 01/01/2008
Field of study

Developing linguistic resources, in particular grammars, is known to be a complex task in itself, because of (amongst others) redundancy and consistency issues. Furthermore some languages can reveal themselves hard to describe because of specific characteristics, e.g. the free word order in German. In this context, we present (i) a framework allowing to describe tree-based grammars, and (ii) an actual fragment of a core multicomponent tree-adjoining grammar with tree tuples (TT-MCTAG) for German developed using this framework. This framework combines a metagrammar compiler and a parser based on range concatenation grammar (RCG) to respectively check the consistency and the correction of the grammar. The German grammar being developed within this framework already deals with a wide range of scrambling and extraction phenomena

CiteSeerX

Hochschulschriftenserver - Universität Frankfurt am Main

TuLiPA : a syntax-semantics parsing environment for mildly context-sensitive formalisms

Author: Dellert Johannes
Kallmeyer Laura
Lichte Timm
Maier Wolfgang
Parmentier Yannick
Publication venue
Publication date: 01/01/2008
Field of study

In this paper we present a parsing architecture that allows processing of different mildly context-sensitive formalisms, in particular Tree-Adjoining Grammar (TAG), Multi-Component Tree-Adjoining Grammar with Tree Tuples (TT-MCTAG) and simple Range Concatenation Grammar (RCG). Furthermore, for tree-based grammars, the parser computes not only syntactic analyses but also the corresponding semantic representations

CiteSeerX

INRIA a CCSD electronic archive server

Hochschulschriftenserver - Universität Frankfurt am Main

TuLiPA : towards a multi-formalism parsing environment for grammar engineering

Author: Dellert Johannes
Evang Kilian
Kallmeyer Laura
Lichte Timm
Maier Wolfgang
Parmentier Yannick
Publication venue
Publication date: 01/01/2008
Field of study

In this paper, we present an open-source parsing environment (Tübingen Linguistic Parsing Architecture, TuLiPA) which uses Range Concatenation Grammar (RCG) as a pivot formalism, thus opening the way to the parsing of several mildly context-sensitive formalisms. This environment currently supports tree-based grammars (namely Tree-Adjoining Grammars (TAG) and Multi-Component Tree-Adjoining Grammars with Tree Tuples (TT-MCTAG)) and allows computation not only of syntactic structures, but also of the corresponding semantic representations. It is used for the development of a tree-based grammar for German

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

Hochschulschriftenserver - Universität Frankfurt am Main

TuLiPA : towards a multi-formalism parsing environment for grammar engineering

Author: Dellert Johannes
Evang Kilian
Kallmeyer Laura
Lichte Timm
Maier Wolfgang
Parmentier Yannick
Publication venue
Publication date: 01/01/2008
Field of study

Hochschulschriftenserver - Universität Frankfurt am Main

Evaluating the Potential of a Large-Scale Polysemy Network as a Model of Plausible Semantic Shifts

Author: Dellert Johannes
Münch Alla
Publication venue: Universität Tübingen
Publication date: 01/01/2015
Field of study

We present a very large network of crosslinguistic polysemies, and compare the notion of semantic relatedness it encodes to the catalogue of semantic shifts maintained by the Russian Academy of Sciences. We separately evaluate all types of semantic shifts featured in the catalogue, including shifts occurring during semantic evolution, during borrowing, and during morphological derivation. The comparison shows that over one third of the attested semantic shifts take place between close neighbors in the network. This can be considered strong evidence for the usefulness of polysemy networks in modelling most types of lexical change, making them a valuable resource e.g. for semantic reconstruction or future automatization of cognate detection. We also show that the semantic shifts which occur during morphological derivation form a divergent class, and might need to be modelled separately

Publikationsserver der Universität Tübingen

Masking Treebanks for the Free Distribution of Linguistic Resources and Other Applications

Author: Dellert Johannes
Rehm Georg
Witt Andreas
Zinsmeister Heike
Publication venue
Publication date: 30/11/2007
Field of study

Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories. Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler. NEALT Proceedings Series, Vol. 1 (2007), 127-138. © 2007 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/4476

DSpace at Tartu University Library

Corpus Masking: Legally Bypassing Licensing Restrictions for the Free Distribution of Text Collections

Author: Dellert Johannes
Rehm Georg
Witt Andreas
Zinsmeister Heike
Publication venue: Urbana-Champaign : University of Illinois
Publication date: 22/12/2015
Field of study

Publikationsserver des Instituts für Deutsche Sprache

Using computational criteria to extract large Swadesh lists for lexicostatistics

Author: Buch Armin
Dellert Johannes
Publication venue: Universität Tübingen
Publication date: 01/01/2016
Field of study

We propose a new method for empirically determining lists of basic concepts for the purpose of compiling extensive lexicostatistical databases. The idea is to approximate a notion of “swadeshness” formally and reproducibly without expert knowledge or bias, and being able to rank any number of concepts given enough data. Unlike previous approaches, our procedure indirectly measures both stability of concepts against lexical replacement, and their proneness to phenomena such as onomatopoesia and extensive borrowing. The method provides a fully automated way to generate customized Swadesh lists of any desired length, possibly adapted to a given geographical region. We apply the method to a large lexical database of Northern Eurasia, deriving a swadeshness ranking for more than 5,000 concepts expressed by German lemmas. We evaluate this ranking against existing shorter lists of basic concepts to validate the method, and give an English version of the 300 top concepts according to this ranking

Publikationsserver der Universität Tübingen